Convex Optimization: Principles of Unconstrained Minimization

We transition from the theoretical existence of a minimum to the algorithmic engine of optimization. Our core objective is to minimize $f(x)$ (9.1) where $f: \mathbf{R}^n \to \mathbf{R}$ is convex and twice continuously differentiable. Since $f$ is differentiable and convex, a necessary and sufficient condition for a point $x^*$ to be optimal is $\nabla f(x^*) = 0$.

The Algorithmic Framework

Numerical solutions construct a minimizing sequence: A sequence of points $x^{(0)}, x^{(1)}, \dots \in \text{dom } f$ with $f(x^{(k)}) \to p^*$ as $k \to \infty$. Each step updates the position via $x^{(k+1)} = x^{(k)} + t^{(k)}\Delta x^{(k)}$, where $\Delta x$ is a descent direction.

Initialization & Convergence

The methods described in this chapter require a suitable starting point $x^{(0)}$. The starting point must lie in $\text{dom } f$, and in addition the sublevel set $S = \{x \in \text{dom } f \mid f(x) \le f(x^{(0)})\}$ must be closed. This ensures the sequence remains in a well-behaved region where the Hessian provides useful information.

Descent Directions

The simplest direction is $\Delta x = -\nabla f(x)$. However, efficiency often requires accounting for different geometries through the steepest descent direction:

Quadratic Norm: $\|z\|_P = (z^T P z)^{1/2} = \|P^{1/2}z\|_2$.
$L_\infty$ Norm: $\Delta x_{\text{sd}} = \Delta x_{\text{nsd}} \|\nabla f(x)\|_\infty = - \frac{\partial f(x)}{\partial x_i} e_i$ (Coordinate Descent).

Second-Order Models and Trust Regions

Newton's method uses a second-order Taylor approximation: $$\hat{f}(x+v) = f(x) + \nabla f(x)^T v + \frac{1}{2} v^T \nabla^2 f(x) v$$ This quadratic is minimized when $v = \Delta x_{nt}$ (the Newton step). We define a trust region: The set $\{v \mid \|v\|_2 \le \gamma\}$. The parameter $\gamma$ reflects our confidence in the second-order model. When the model is accurate, we solve for the direction via $v = L^{-T}w = -L^{-T}L^{-1}P^T g$ in KKT systems.

🎯 Core Principles of Convergence

Efficiency is measured by the speed at which the error $f(x^{(k)}) - p^*$ vanishes. For strongly convex functions, the error $f(x^{(k)}) - p^*$ converges to zero at least as fast as a geometric series. In the context of iterative numerical methods, this is called linear convergence.

Suboptimality Bound: $p^* \geq f(x) + \lambda(x) + \log(1 - \lambda(x))$, valid if $\lambda(x) < 1$.
Self-concordance Sum: If $f_1, f_2$ are self-concordant, then $f_1 + f_2$ is self-concordant.
Hessian Sparsity: Efficiency is gained if the Hessian band structure condition: $\nabla^2 f(x)_{ij} = 0$ for $|i-j| > k$ holds.

QUESTION 1

Which condition is both necessary and sufficient for x* to be a global minimizer of a differentiable convex function?

$∇² f(x*) ≽ 0$

$∇ f(x*) = 0$

The sublevel set S is closed

$f(x*) = p^*$

QUESTION 2

What happens to Newton's method if we apply coordinate transformations (Affine Invariance)?

The path to the optimum changes significantly

The algorithm becomes linear convergence only

The method is independent of linear coordinate transformations

The Hessian becomes singular

QUESTION 3

When is the suboptimality bound $p^* \geq f(x) + \lambda(x) + \log(1 - \lambda(x))$ guaranteed to be valid?

Only when $\lambda(x) < 1$

For any $x \in \text{dom } f$

Only for quadratic functions

When the Hessian is diagonal

QUESTION 4

If $f_1$ and $f_2$ are self-concordant, which of the following is also self-concordant?

$f_1 \cdot f_2$

$f_1 / f_2$

$f_1 + f_2$

$f_1 - f_2$

QUESTION 5

What is the defining characteristic of linear convergence in iterative methods?

The error reaches zero in exactly n steps

The error converges to zero at least as fast as a geometric series

The Hessian is constant

The step size t is always 1